Numpy is a library that is especially useful for when you want to work with large arrays and matrices of numeric data. Numpy is optimized to run fast, much faster than working with Python lists directly.
At the foundation of Numpy is the array object class. Numpy arrays are like python lists except that every element in the array has to be of the same type.
In [2]:
import numpy as np
array = np.array([1,2,3,4,5],float)
In [3]:
import numpy as np
array = np.array([1,2,3,4,5],float)
print array
matrix = np.array([[1,2,3,4],[5,6,7,8]],float)
print matrix
You can index, slice, and manipulate a numpy array much like you do Python lists
In [4]:
print array[1]
print array[:2]
print array[3:]
In [5]:
print matrix[1][1]
print matrix[1,:]
print matrix[:,1]
You can perform arithmetic operations on Numpy arrays
In [6]:
array1 = np.array([1,2],float)
array2 = np.array([3,4],float)
print array1 + array2
print array1 - array2
print array1 * array2
print array1 / array2
In [7]:
matrix1 = np.array([[1,2],[3,4]],float)
matrix2 = np.array([[5,6],[7,8]],float)
print matrix1 + matrix2
print matrix1 - matrix2
print matrix1 * matrix2
print matrix1 / matrix2
In addition to arithmetic operations, Numpy has a number of other mathematical operations you can apply (e.g., mean, median, standard deviation, dot product)
In [8]:
# Functions useful for statistical analysis
array1 = np.arange(1,6)
print array1
print np.mean(array1)
print np.median(array1)
print np.std(array1)
In [9]:
# Dot product
array2 = np.arange(3,8)
print array1
print array2
print(np.dot(array1,array2))
In [10]:
# Dot product (matrix)
oneDArray= np.array([1,2])
twoDArray = np.array([[2,4,6],[3,5,7]])
print np.dot(oneDArray,twoDArray)
In [11]:
from pandas import DataFrame, Series
In [12]:
# Create a dictionary where the key = name of columns and the value is a corresponding series
# Value -> 1st pass the data you want in dataframe and index where you want the data to go
#
# You can think about a Series as a one-dimensional object that is similar to an array,
# list, or column in a database. By default it will assign an index label to each to each
# item in the Series ranging from zero to N-1, where N is the number of items in the series.
#
data = {'name':Series(['Braund','Cummings','Heikkinen','Allen'],index = ['a','b','c','d']),
'age':Series([22,38,26,35],index = ['a','b','c','d']),
'Fare':Series([7.25,71.83,8.05],index = ['a','b','d']),
'Survived':Series([False,True,True,False],index = ['a','b','c','d'])}
In [13]:
# Pass data as a argument to the data frame function to create the actual data frame
df = DataFrame(data)
print df
In [14]:
# You can operate on specific columns by calling them as if they were a key in a dictionary.
# You can access one column. When you do, you get a Series
print df['name']
In [15]:
# You can access multiple columns by passing list of column names.
# When you do, you get back a dataframe.
df[['name','age','Survived']]
Out[15]:
In [16]:
print df
# Select rows in which the passanger age >= 30
print df[df['age']>= 30]
In [17]:
# You can also perform the above operation on particular columns
# Example: Get 'Survived' information for passagengers whose age >= 30
# Notes:
# df['Survived'] only picks out data from the 'Survived' column
# df['Survived'][df['age']>= 30] picks indices where this statment 'df[age] > 30' is true.
print df
print df['Survived'][df['age'] >= 30]
You can access rows through multiple ways
You could also combine multiple selection requirements through boolean operators like & (and) and | (or)
In [18]:
print df
# Get row corresponding to passenger "Braund", whose index is a
print df.loc['a']
print ""
# You can also access via integer position
print df.iloc[[1]]
In [19]:
print df
# Example: find the passengers with age >= 30
print df[df['age']>= 30]
In [20]:
print df
print df[0:2]
In [21]:
print df
# Multiple selection using & (and)
print df[(df.Survived == True) & (df.age > 30)]
It is possible to perform boolean indexing on specific columns
Pandas also has various functions that help you understand some basic information about your dataframe. Some of this functions are:
In [22]:
print df.dtypes
print ""
print df.describe()
In [23]:
# First, lets create a dataframe
d = {'one': Series([1,2,3], index = ['a','b','c']),
'two': Series([1,2,3,4], index = ['a','b','c','d'])}
df = DataFrame(d)
print df
In [24]:
# Second, we apply an arbitrary function to all the columns
# in the dataframe using pf.apply.
# The result itself is a new dataframe
# Example: apply numpy.mean to every column (axis = 0 for column operation, axis = 1 for row operation)
df.apply(np.mean, axis = 0)
Out[24]:
In [25]:
# You can apply map to a column in the dataframe or the entire dataframe.
# This functions will allow you to apply functions that take in a single value and returns
# a single value
# Example
print df['one']
#Go through every single value in 'One' and applies lambda function
print df['one'].map(lambda x: x> 1)
# Refresher:
# Lambda functions are small inline functions that are
# defined on-the-fly in Python. lambda x: x>= 1 will take an input x and return x>=1,
# or a boolean that equals True or False.
# For more info go to: https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions
In [26]:
# The function can be applied to all the columns of the dataframe
print df
print df.applymap(lambda x: x>1)
Note: map() can only be used on a Series to return a new Series and applymap() can only be used on a DataFrame to return a new DataFrame.
In [27]:
'''
Compute the average of bronze medals earned by countries who earned at least one gold medal.
Save this to a variable called: avg_bronze_at_least_one_gold
HINT-1:
You can retrieve all of the values of a Pandas column from a
data frame, "df", as follows:
df['column_name']
HINT-2:
The numpy.mean function can accept as an argument a single
Pandas column.
For example, numpy.mean(df["col_name"]) would return the
mean of the values located in "col_name" of a dataframe df.
'''
countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
'Netherlands', 'Germany', 'Switzerland', 'Belarus',
'Austria', 'France', 'Poland', 'China', 'Korea',
'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']
gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]
olympic_medal_counts = {'country_name':Series(countries),
'gold': Series(gold),
'silver': Series(silver),
'bronze': Series(bronze)}
df = DataFrame(olympic_medal_counts)
np.mean(df['bronze'][df['gold']>0])
np.mean(df[df['gold']>0]['bronze'])
Out[27]:
Using the dataframe's apply method, create a new Series called avg_medal_count that indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi olympics.
In [28]:
# Original data frame
print df
In [29]:
# First: Get subset of dataframe that only contains medal information
df1 = df[['gold','silver','bronze']]
print df1
In [30]:
# Second: Keep rows in which there is at least 1 medal
df2 = df1[(df1['gold']>0) | (df1['silver']>0) | (df1['bronze']>0)]
print df2
In [31]:
# Third: Get average number of gold, silver, and bronze
avg_medal_count = Series(df2.apply(np.mean))
print avg_medal_count
Imagine a point system in which each country is awarded 4 points for each gold medal, 2 points for each silver medal, and 1 point for each bronze medal.
Using the numpy.dot function, create a new dataframe that includes: a) a column called 'country_name' with the country name b) a column called 'points' with the total number of points the country earned at the Sochi olympics
In [33]:
# Create an array with point system
awards = np.array([4,2,1])
data = {'country_name':df['country_name'],
'points':Series(df[['gold','silver','bronze']].apply(lambda x:np.dot(awards,x),axis = 1))}
newDF = DataFrame(data)
print newDF
In [37]:
dp_function = lambda x:np.dot(awards,x);
data = {'country_name':df['country_name'],
'points':Series(df[['gold','silver','bronze']].apply(dp_function,axis = 1))}
newDF = DataFrame(data)
print newDF
In [32]: